Using Machine Learning and Company Fundamentals for Beating the Stock Market¶

A Practical Guide on Building a Machine Learning Model for Stock Trading¶

In this post, I will show how to build a simple Machine Learning model classifier that can be used to help investors choose which stocks to invest in.

This article, although introductory, is targeted at people with some Python and ML knowledge. If you are interested in a more high-level explanation of ML and Trading, check out this post.

The following topics will be covered:

  • Data preparation - target and feature creation (Pandas)
  • Exploratory Data Analysis (Pandas Profiling) 
  • Modeling (LightGBM)
  • Evaluation - Classifier, and Portfolio (visualization with Plotly)

The data and code used in this article are available on my GitHub page. Feel free to clone the project to follow along.

In the repository, you will also find a README.md explaining how to set up an environment with all the necessary dependencies.

In [1]:
# load lab_black for easy code formating
%load_ext lab_black
In [ ]:
 

💽 Data¶

To make it easier to follow along, I have created a sample dataset containing historical data for the 30 Dow Jones constituents. This dataset has already been aggregated at month level and some financial ratios have also been created.

The raw data comes from Tiingo, a company that offers financial data APIs.

In [ ]:
 

🔃 Load the Dataset¶

Let's load the dataset into Pandas.

Here is the meaning of the columns:

  • ticker: ticker identifier of the stock
  • date: date (end of the month)
  • adjOpen: opening price (adjusted) on the first day of the month
  • adjClose: closing price (adjusted) on the last day of the month
  • price_rate_of_change_1M: the return of the stock during the last month
  • price_rate_of_change_3M: the return of the stock during the last 3 months
  • epsDil: earnings per share diluted
  • return_on_assets: return on assets ratio
  • return_on_equity: return on equity ratio
  • price_to_earnings_ratio: p/e ratio
  • debt_to_equity_ratio: d/e ratio
In [2]:
import pandas as pd

pd.options.mode.chained_assignment = None

# read dataset
df = pd.read_csv("data\dataset.csv", parse_dates=["date"])

# format index
df = df.set_index(["ticker", "date"])

# display data
df
Out[2]:
adjOpen adjClose price_rate_of_change_1M price_rate_of_change_3M epsDil return_on_assets return_on_equity price_to_earnings_ratio debt_to_equity_ratio
ticker date
AAPL 2000-01-31 0.801664 0.793102 0.009143 0.294933 0.006 0.021507 0.035760 132.183691 0.662693
2000-02-29 0.795013 0.876196 0.104771 0.171145 0.009 0.024123 0.041459 97.355147 0.718623
2000-03-31 0.906315 1.038180 0.184872 0.320980 0.009 0.024123 0.041459 115.353363 0.718623
2000-04-30 1.035811 0.948359 -0.086518 0.195759 0.009 0.024123 0.041459 105.373229 0.718623
2000-05-31 0.954551 0.642126 -0.322908 -0.267144 0.011 0.033252 0.055279 58.375098 0.662396
... ... ... ... ... ... ... ... ... ... ...
WMT 2020-08-31 126.402983 135.654958 0.077424 0.123800 1.400 0.017132 0.058470 96.896398 2.326817
2020-09-30 137.950883 136.690566 0.007634 0.172842 2.270 0.027281 0.085991 60.216109 2.073895
2020-10-31 137.560087 135.557259 -0.008291 0.076648 2.270 0.027281 0.085991 59.716854 2.073895
2020-11-30 137.354919 149.274188 0.101189 0.100396 2.270 0.027281 0.085991 65.759554 2.073895
2020-12-31 150.065549 141.350206 -0.053083 0.034089 1.800 0.020469 0.063060 78.527892 2.006103

7160 rows × 9 columns

In [ ]:
 

🎯 Target Creation¶

We are going to build a classifier model, where the target will be a boolean variable indicating if the stock has grown by more than X% in the last month.

Before doing that, we take the following assumptions:

  • we buy on the first day of the month at the opening price (adjOpen)
  • we sell on the last day of the month at the closing price (adjClose)
  • the return is the difference between buying and selling prices (in %)

The target will then take the following values:

  • True if the stock's return is higher or equal than X%
  • False otherwise

The choice of the target threshold X is something that can be determined using experiments or simulations on past data. Here we will keep it nice and simple and use a fixed threshold of 5%.
Feel free to experiment with other thresholds (10%, 20%…) or even build a more complicated target using take-profits and stop-losses like in the Triple Barrier method suggested by de Prado (Advances in Financial Machine Learning, Marcos Lopez de Prado).

Here is the code for creating the target.

In [3]:
# if the price increases by more than x%, we label it as "True" or "Buy"
threshold = 0.05  # 5%

# calculate the return within the month
df["return_month"] = (df["adjClose"] / df["adjOpen"]) - 1

# create the target
df["target"] = df["return_month"] >= threshold

# display data
df[["adjOpen", "adjClose", "return_month", "target"]]
Out[3]:
adjOpen adjClose return_month target
ticker date
AAPL 2000-01-31 0.801664 0.793102 -0.010680 False
2000-02-29 0.795013 0.876196 0.102115 True
2000-03-31 0.906315 1.038180 0.145496 True
2000-04-30 1.035811 0.948359 -0.084428 False
2000-05-31 0.954551 0.642126 -0.327300 False
... ... ... ... ... ...
WMT 2020-08-31 126.402983 135.654958 0.073194 True
2020-09-30 137.950883 136.690566 -0.009136 False
2020-10-31 137.560087 135.557259 -0.014560 False
2020-11-30 137.354919 149.274188 0.086777 True
2020-12-31 150.065549 141.350206 -0.058077 False

7160 rows × 4 columns

In [ ]:
 

🥗 Features Creation¶

To simplify the notebook, I have already pre-computed some features using the raw data. If you are interested to know how I created them, let me know in the comments.

We will use the following features for building the model:

  • price_rate_of_change_1M
  • price_rate_of_change_3M
  • epsDil
  • return_on_assets
  • return_on_equity
  • price_to_earnings_ratio
  • debt_to_equity_ratio

A very important step here is to shift the value of the features by one period. 

Why do we do this? 

Because the actual values of those features are only known at the end of the month. We need to make sure that the input data (the features) we use for predicting the target is available at the beginning of the month. By shifting the features' values by one period, we make sure that no data leakage is created.

In [4]:
# list of features
features = [
    "price_rate_of_change_1M",
    "price_rate_of_change_3M",
    "epsDil",
    "return_on_assets",
    "return_on_equity",
    "price_to_earnings_ratio",
    "debt_to_equity_ratio",
]

# shift the value of the features by one period (make sure to use groupby!)
df[features] = df.groupby("ticker")[features].shift(1)

We then need to remove the first row for each ticker, to remove the NaN created during the shift() operation.

In [5]:
# remove the first row for each ticker to get rid of the NaN created after doing the shift
df = df.loc[df.groupby("ticker").cumcount() > 0]
In [6]:
# display data
df[features + ["target"]]
Out[6]:
price_rate_of_change_1M price_rate_of_change_3M epsDil return_on_assets return_on_equity price_to_earnings_ratio debt_to_equity_ratio target
ticker date
AAPL 2000-02-29 0.009143 0.294933 0.006 0.021507 0.035760 132.183691 0.662693 True
2000-03-31 0.104771 0.171145 0.009 0.024123 0.041459 97.355147 0.718623 True
2000-04-30 0.184872 0.320980 0.009 0.024123 0.041459 115.353363 0.718623 False
2000-05-31 -0.086518 0.195759 0.009 0.024123 0.041459 105.373229 0.718623 False
2000-06-30 -0.322908 -0.267144 0.011 0.033252 0.055279 58.375098 0.662396 True
... ... ... ... ... ... ... ... ... ...
WMT 2020-08-31 0.080314 0.069299 1.400 0.017132 0.058470 89.933393 2.326817 True
2020-09-30 0.077424 0.123800 1.400 0.017132 0.058470 96.896398 2.326817 False
2020-10-31 0.007634 0.172842 2.270 0.027281 0.085991 60.216109 2.073895 False
2020-11-30 -0.008291 0.076648 2.270 0.027281 0.085991 59.716854 2.073895 True
2020-12-31 0.101189 0.100396 2.270 0.027281 0.085991 65.759554 2.073895 False

7130 rows × 8 columns

In [ ]:
 

🔍 Exploratory Data Analysis¶

To do a quick data exploration, we will use the Pandas Profiling library. This library is super nice for building profiling reports in just one line of code, as shown below.

We get this nice interactive report, feel free to scroll through it in the notebook.

From the report, we can see that the data is quite clean overall. 

It is important to notice that the target distribution is unbalanced, something to keep in mind during the modeling part!

We also see some outliers for the columns price_to_earnings_ratio and debt_to_equity_ratio. I am not going to dig deeper into this now, however, this is something interesting to look at at a later stage. Removing or fixing those outliers might improve the performance of the model.

In [7]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report", minimal=True)

profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

From the report, we can see that the data is quite clean overall. 

It is important to notice that our target distribution is unbalanced, something to keep in mind during the modeling part!

We also see some outliers for the columns price_to_earnings_ratio and debt_to_equity_ratio. I am not going to dig deeper into this now, however this is definitely something interesting to look at in a later stage. Removing or fixing those outliers might improve the performance of the model, let's remember that data quality is very important for Machine Learning!

In [ ]:
 

🤖 Modeling¶

Now that we have a dataset to work with, we are going to build a simple classifier model using LightGBM. 

LightGBM is a gradient boosting framework that uses tree-based learning algorithms.

I like LightGBM because of its high accuracy, fast training, and most importantly its ease of use. Indeed, you barely need any pre-processing to make a LightGBM model work, categorical features or NaN are taken care of automatically for example. 

As it is tree-based, it is also better for the environment (😁) and easier to explain and visualize for non-technical people, leading to more acceptance of the model by the stakeholders.

In [ ]:
 

🪓 Splitting the Data¶

For the sake of simplicity, we will use a simple train/test split and will leave out any more complicated cross-validation procedures. 
Data for the year 2020 will be used for testing, and data before that will be used for training.

In [8]:
split_date = 2020

df_train = df.loc[df.index.get_level_values("date").year < split_date]
df_test = df.loc[df.index.get_level_values("date").year == split_date]
In [9]:
# show train data
print(
    df_train.index.get_level_values("date").min(),
    df_train.index.get_level_values("date").max(),
)
df_train
2000-02-29 00:00:00 2019-12-31 00:00:00
Out[9]:
adjOpen adjClose price_rate_of_change_1M price_rate_of_change_3M epsDil return_on_assets return_on_equity price_to_earnings_ratio debt_to_equity_ratio return_month target
ticker date
AAPL 2000-02-29 0.795013 0.876196 0.009143 0.294933 0.006 0.021507 0.035760 132.183691 0.662693 0.102115 True
2000-03-31 0.906315 1.038180 0.104771 0.171145 0.009 0.024123 0.041459 97.355147 0.718623 0.145496 True
2000-04-30 1.035811 0.948359 0.184872 0.320980 0.009 0.024123 0.041459 115.353363 0.718623 -0.084428 False
2000-05-31 0.954551 0.642126 -0.086518 0.195759 0.009 0.024123 0.041459 105.373229 0.718623 -0.327300 False
2000-06-30 0.624926 0.800823 -0.322908 -0.267144 0.011 0.033252 0.055279 58.375098 0.662396 0.281468 True
... ... ... ... ... ... ... ... ... ... ... ... ...
WMT 2019-08-31 105.399599 109.697015 -0.000996 0.079033 1.330 0.016381 0.056330 79.290920 2.340503 0.040773 False
2019-09-30 109.140178 113.940502 0.040207 0.131881 1.330 0.016381 0.056330 82.478959 2.340503 0.043983 False
2019-10-31 114.103713 112.577210 0.038684 0.079370 1.260 0.015371 0.051332 90.428970 2.242809 -0.013378 False
2019-11-30 113.210853 114.334129 -0.011965 0.067518 1.260 0.015371 0.051332 89.346992 2.242809 0.009922 False
2019-12-31 114.391733 114.603719 0.015606 0.042272 1.260 0.015371 0.051332 90.741372 2.242809 0.001853 False

6770 rows × 11 columns

In [10]:
# show test data
print(
    df_test.index.get_level_values("date").min(),
    df_test.index.get_level_values("date").max(),
)
df_test
2020-01-31 00:00:00 2020-12-31 00:00:00
Out[10]:
adjOpen adjClose price_rate_of_change_1M price_rate_of_change_3M epsDil return_on_assets return_on_equity price_to_earnings_ratio debt_to_equity_ratio return_month target
ticker date
AAPL 2020-01-31 72.882172 76.146912 0.098784 0.315005 0.758 0.040429 0.151247 95.309987 2.741004 0.044795 False
2020-02-29 74.865126 67.414954 0.054010 0.247904 1.248 0.065281 0.248361 61.015154 2.804470 -0.099515 False
2020-03-31 69.614769 62.711987 -0.114673 0.025324 1.248 0.065281 0.248361 54.018393 2.804470 -0.099157 False
2020-04-30 60.790848 72.455785 -0.069761 -0.131954 1.248 0.065281 0.248361 50.249989 2.804470 0.191886 True
2020-05-31 70.593835 78.616414 0.155374 -0.048474 1.248 0.065281 0.248361 58.057520 2.804470 0.113644 True
... ... ... ... ... ... ... ... ... ... ... ... ...
WMT 2020-08-31 126.402983 135.654958 0.080314 0.069299 1.400 0.017132 0.058470 89.933393 2.326817 0.073194 True
2020-09-30 137.950883 136.690566 0.077424 0.123800 1.400 0.017132 0.058470 96.896398 2.326817 -0.009136 False
2020-10-31 137.560087 135.557259 0.007634 0.172842 2.270 0.027281 0.085991 60.216109 2.073895 -0.014560 False
2020-11-30 137.354919 149.274188 -0.008291 0.076648 2.270 0.027281 0.085991 59.716854 2.073895 0.086777 True
2020-12-31 150.065549 141.350206 0.101189 0.100396 2.270 0.027281 0.085991 65.759554 2.073895 -0.058077 False

360 rows × 11 columns

In [ ]:
 

👔 Fitting the Classifier¶

Let's create a classifier estimator and fit it on the train data. I am using the default hyperparameters except for is_unbalance which is set to True (given the high-class imbalance of the dataset) and max_depth, num_leaves, and min_child_samples that are set to "appropriate" values according to the lightGBM documentation.

Feel free to experiment with other hyperparameters!

In [11]:
from lightgbm import LGBMClassifier

# define classifier
estimator = LGBMClassifier(
    is_unbalance=True,
    max_depth=4,
    num_leaves=8,
    min_child_samples=400,
    n_estimators=50,
)

# fit classifier on training data
estimator.fit(df_train[features], df_train["target"])
Out[11]:
LGBMClassifier(is_unbalance=True, max_depth=4, min_child_samples=400,
               n_estimators=50, num_leaves=8)
In [ ]:
 

🔮 Predicting on the Test Data¶

Once the model has been fitted on the training data, we can use it to make predictions on the test data. A new column buy is created in df_test, it contains the predictions made by the model.

In [12]:
# make prediction using test data
df_test["buy"] = estimator.predict(df_test[features])

# display data
df_test[["return_month", "target", "buy"]]
Out[12]:
return_month target buy
ticker date
AAPL 2020-01-31 0.044795 False False
2020-02-29 -0.099515 False False
2020-03-31 -0.099157 False False
2020-04-30 0.191886 True True
2020-05-31 0.113644 True False
... ... ... ... ...
WMT 2020-08-31 0.073194 True False
2020-09-30 -0.009136 False False
2020-10-31 -0.014560 False False
2020-11-30 0.086777 True False
2020-12-31 -0.058077 False False

360 rows × 3 columns

In [13]:
# display only the stocks with buy=True
df_test.loc[df_test["buy"] == True][["return_month", "target", "buy"]]
Out[13]:
return_month target buy
ticker date
AAPL 2020-04-30 0.191886 True True
AMGN 2020-02-29 -0.070369 False True
2020-03-31 0.014157 False True
2020-06-30 0.030451 False True
2020-11-30 0.008922 False True
... ... ... ... ...
WBA 2020-11-30 0.114243 True True
2020-12-31 0.039021 False True
WMT 2020-02-29 -0.062837 False True
2020-03-31 0.060722 True True
2020-07-31 0.083298 True True

140 rows × 3 columns

Now that we have predictions on the test set, we can move on to the evaluation part, where we are going to assess the performance of the model.

In [ ]:
 

✅ Evaluating the Model¶

We are going to evaluate the performance of the model in two ways:

  • Classifier performance: using metrics such as accuracy, precision, recall, etc. We can assess how good the classifier is at distinguishing good vs. bad performing stocks.
  • Portfolio performance: using backtesting, we can simulate how much money we would have made (or not!), and compute financial metrics like total return, Sharpe ratio, drawdown, etc.

📝 Classifier Performance¶

To get an overall idea of the performance of the classifier, we will use the classification_report from sklearn.

The overall accuracy is 61%. The model does quite a fairly good job at predicting the False class (73% precision, 66% recall) but is less good at predicting the True class (42% precision, 50% recall).

We should be careful when using accuracy in this case, given the high-class imbalance of the dataset.

In [14]:
from sklearn.metrics import classification_report

print(classification_report(df_test["target"], df_test["buy"]))
              precision    recall  f1-score   support

       False       0.73      0.66      0.69       241
        True       0.42      0.50      0.46       119

    accuracy                           0.61       360
   macro avg       0.57      0.58      0.57       360
weighted avg       0.63      0.61      0.62       360

In [ ]:
 

📈 Portfolio Performance¶

Let's now focus more on the financial performance of the model. That is, would we have been able to make money with this model?

To do that, we take the following assumptions:

  • each month we invest in n different stocks (depending on the model predictions)
  • we invest the same amount on each stock (1/n)

With those assumptions, we can easily compute the monthly return of the strategy and then calculate financial metrics like total return or Sharpe ratio.

We start by selecting only the stocks for which the model made a positive prediction (buy).

In [15]:
# select only the stocks that were picked by the model
df_buy = df_test.loc[df_test["buy"] == True][["return_month", "target", "buy"]]
df_buy
Out[15]:
return_month target buy
ticker date
AAPL 2020-04-30 0.191886 True True
AMGN 2020-02-29 -0.070369 False True
2020-03-31 0.014157 False True
2020-06-30 0.030451 False True
2020-11-30 0.008922 False True
... ... ... ... ...
WBA 2020-11-30 0.114243 True True
2020-12-31 0.039021 False True
WMT 2020-02-29 -0.062837 False True
2020-03-31 0.060722 True True
2020-07-31 0.083298 True True

140 rows × 3 columns

We then aggregate the data at month level to have an overview of how many stocks the model picked per month, and how much the average return was. We can use the mean return per month because we took the assumption that we would be investing 1/n on each selected stock.

In [16]:
df_results = (
    df_buy.reset_index()
    .groupby("date")
    .agg({"ticker": "count", "return_month": "mean"})
)

df_results
Out[16]:
ticker return_month
date
2020-01-31 3 0.048522
2020-02-29 9 -0.085939
2020-03-31 24 -0.122494
2020-04-30 27 0.148161
2020-05-31 12 0.064157
2020-06-30 5 0.084158
2020-07-31 9 0.000687
2020-08-31 9 0.095477
2020-09-30 8 -0.038285
2020-10-31 15 -0.044311
2020-11-30 12 0.152646
2020-12-31 7 0.052516

We can use the describe() function to get some statistics.

In [17]:
df_results.describe()
Out[17]:
ticker return_month
count 12.000000 12.000000
mean 11.666667 0.029608
std 7.227892 0.088410
min 3.000000 -0.122494
25% 7.750000 -0.039792
50% 9.000000 0.050519
75% 12.750000 0.086988
max 27.000000 0.152646

The number of stocks picked per month ranges from 3 to 27 and the average return per month is 2.96%.

Let's also compute the Sharpe ratio, which is a very common metric to use to assess the return of an investment compared to its risk.

In [18]:
import numpy as np


def sharpe(s_return: pd.Series, annualize: int, rf: float = 0) -> float:
    """
    Calculate sharpe ratio

    :param s_return: pd.Series with return
    :param annualize: int periods to use for annualization (252 daily, 12 monthly, 4 quarterly)
    :param rf: float risk-free rate
    :return: float sharpe ratio
    """
    # (mean - rf) / std
    sharpe_ratio = (s_return.mean() - rf) / s_return.std()

    # annualize
    sharpe_ratio = sharpe_ratio * np.sqrt(annualize)

    return sharpe_ratio
In [19]:
sharpe_ratio = sharpe(df_results["return_month"], annualize=12)
print(f"Sharpe ratio: {round(sharpe_ratio, 2)}")
Sharpe ratio: 1.16

Sharpe ratio of 1.16, not bad :)

In [ ]:
 

🎨 Visualization¶

To visualize the return over time, we first need to calculate the cumulative return.

In [20]:
# by using the monthly return, we can calculate the cumulative return over the entire year
df_results["return_month_cumulative"] = (df_results["return_month"] + 1).cumprod() - 1

df_results
Out[20]:
ticker return_month return_month_cumulative
date
2020-01-31 3 0.048522 0.048522
2020-02-29 9 -0.085939 -0.041586
2020-03-31 24 -0.122494 -0.158986
2020-04-30 27 0.148161 -0.034381
2020-05-31 12 0.064157 0.027570
2020-06-30 5 0.084158 0.114049
2020-07-31 9 0.000687 0.114814
2020-08-31 9 0.095477 0.221252
2020-09-30 8 -0.038285 0.174496
2020-10-31 15 -0.044311 0.122453
2020-11-30 12 0.152646 0.293791
2020-12-31 7 0.052516 0.361736

We can then make some nice plots using Plotly.

In [21]:
import plotly.express as px

# plot monthly return
fig = px.bar(df_results, y="return_month", title="Monthly return (%)")
fig.show()

# plot cumulative return
fig = px.line(df_results, y="return_month_cumulative", title="Cumulative return")
fig.show()

Those first results don't look bad at all, even with the covid-19 crisis around March, we end up in 2020 with a +36% return, this is very promising :)

However, to have a complete picture, we need to put these results in perspective and compare them with a benchmark strategy. As we used the 30 stocks of the Dow Jones, we will use an ETF that is tracking the same index: DIA.

Let's load the data of this ETF.

In [22]:
# load the historical price DIA (benchmark strategy)
df_benchmark = pd.read_csv("data/prices_DIA.csv")
df_benchmark
Out[22]:
date adjOpen adjClose return_month return_month_cumulative
0 2020-01-31 00:00:00+00:00 274.265829 270.544375 -0.013569 -0.013569
1 2020-02-29 00:00:00+00:00 271.760972 244.532466 -0.100193 -0.112402
2 2020-03-31 00:00:00+00:00 246.597774 211.254631 -0.143323 -0.239615
3 2020-04-30 00:00:00+00:00 202.986762 234.497751 0.155237 -0.121576
4 2020-05-31 00:00:00+00:00 230.988287 245.649163 0.063470 -0.065822
5 2020-06-30 00:00:00+00:00 245.156493 249.831985 0.019071 -0.048006
6 2020-07-31 00:00:00+00:00 250.800814 256.312811 0.021978 -0.027083
7 2020-08-31 00:00:00+00:00 257.592678 276.232473 0.072362 0.043318
8 2020-09-30 00:00:00+00:00 275.698043 270.294625 -0.019599 0.022870
9 2020-10-31 00:00:00+00:00 272.077108 258.283073 -0.050699 -0.028988
10 2020-11-30 00:00:00+00:00 262.024893 289.680622 0.105546 0.073498
11 2020-12-31 00:00:00+00:00 292.717864 299.217200 0.022203 0.097334

We compute the sharpe ratio for this benchmark strategy.

In [23]:
sharpe_ratio_benchmark = sharpe(df_benchmark["return_month"], annualize=12)
print(f"Sharpe ratio benchmark: {round(sharpe_ratio_benchmark, 2)}")
Sharpe ratio benchmark: 0.45

Sharpe ratio of 0.45, this is much lower than the one of the ML model!

Let's plot both strategies on the same graph to get a better idea of the difference.

In [24]:
import plotly.graph_objects as go

fig = go.Figure()
fig = fig.add_trace(
    go.Scatter(y=df_results["return_month_cumulative"], name="ML Model"),
)
fig = fig.add_trace(
    go.Scatter(y=df_benchmark["return_month_cumulative"], name="Benchmark")
)
fig.update_layout(
    title="Cumulative Return",
)
fig.show()

The ML model follows the same trajectory as the benchmark strategy, which makes sense given that the set of stocks is limited to only 30 tickers.

However, it does seem that the model was able to distinguish and pick highly performant stocks, leading to a 3x higher return than the benchmark and a boosted Sharpe ratio, nice job!

In [ ]:
 

🔚 Conclusion¶

I hope this introductory post was able to give some insights on how to use Machine Learning for Trading. 

Feel free to play around with the notebook, try different algorithms and features, or improve the backtesting approach with weights, stop-loss, take-profit, fees, etc. 
Watch out though, it is very easy to get a "good looking" backtest in Financial Machine Learning, something that is often stressed by financial professionals. In another post, I will write about the danger of repeated backtesting and show how easy it is to overfit a model and how nested-cross validation can help to reduce this risk.

If you have any questions or remarks, drop a comment, and follow me for more posts like this one!

You can also follow me on the trading platform eToro, where I use Machine Learning at the core of my investment strategy.

In [5]:
!jupyter nbconvert --to=pdf Notebook.ipynb
[NbConvertApp] Converting notebook Notebook.ipynb to pdf
C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\filters\datatypefilter.py:39: UserWarning: Your element with mimetype(s) dict_keys(['text/html']) is not able to be represented.
  warn("Your element with mimetype(s) {mimetypes}"
[NbConvertApp] Support files will be in Notebook_files\
[NbConvertApp] Making directory .\Notebook_files
[NbConvertApp] Making directory .\Notebook_files
[NbConvertApp] Making directory .\Notebook_files
[NbConvertApp] Writing 84942 bytes to notebook.tex
[NbConvertApp] Building PDF
Traceback (most recent call last):
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\Scripts\jupyter-nbconvert-script.py", line 10, in <module>
    sys.exit(main())
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\jupyter_core\application.py", line 269, in launch_instance
    return super().launch_instance(argv=argv, **kwargs)
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\traitlets\config\application.py", line 846, in launch_instance
    app.start()
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 369, in start
    self.convert_notebooks()
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 541, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 506, in convert_single_notebook
    output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 435, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\exporter.py", line 190, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\exporter.py", line 208, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\pdf.py", line 183, in from_notebook_node
    self.run_latex(tex_file)
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\pdf.py", line 153, in run_latex
    return self.run_command(self.latex_command, filename,
  File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\pdf.py", line 110, in run_command
    raise OSError("{formatter} not found on PATH, if you have not installed "
OSError: xelatex not found on PATH, if you have not installed xelatex you may need to do so. Find further instructions at https://nbconvert.readthedocs.io/en/latest/install.html#installing-tex.